Gradient descent#
Introduction#
When you have to train, for instance, a predictive model, you set the optimal parameters so that the prediction is as accurate as possible. The parameters could be set randomly by a guess or with the assistance of an optimization algorithm. One such algorithm is Gradient descent. When running a model, there is a way of measuring how far off predictions are. So the goal of the gradient descent is to minimize the error, as the algorithm provides smart guidance that helps adjust the parameters. The implementation of gradient descent optimization is described below:
Algorithm (Gradient descent)
To solve the optimization problem
do the following steps:
initialize \(\boldsymbol w\) by some random values (e.g., from \(\mathcal N(0, 1\)))
choose tolerance \(\varepsilon > 0\) and learning rate \(\eta > 0\)
while \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) do the gradient step
\[ \boldsymbol w := \boldsymbol w - \eta\nabla\mathcal L(\boldsymbol w) \]return \(\boldsymbol w\)
Fig. 1 Main idea of gradient descent algorithm#
Note
Learning rate determines the size of the steps taken during the optimization and influences the convergence speed, thereby, a smaller learning rate may lead to more precise results but slower convergence.
Note
If condition \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) holds for too long, the loop in step 3 terminates after some number iterations max_iter.
Adding Momentum#
Purpose of Momentum
When employing gradient descent, several challenges arise:
Becoming stuck at a local minimum, a consequence of the algorithm’s greediness.
Overshooting and overlooking the global optimum due to excessively rapid movement along the gradient direction.
Oscillation, a phenomenon occurring when the function’s value remains relatively constant, resembling navigation on a plateau where the height remains the same regardless of the direction.
To address these issues, a momentum term denoted as \(\alpha\) is introduced into the expression for \(\Delta \textbf{w}\) to stabilise the learning rate while approaching the global optimum value.
In the following, the superscript i indicates the iteration number:
Gradient Descent for a Univariate Function#
Let’s initialize the objective function and its derivative. And set up gradient descent function.
Note
Derivative \(f'(x)\) is used to understand how function changes at a specific point, as it helps to figure out direction and speed toward function’s minimum.
gradient_descent is performing optimization by using following parameters:
iterations_limit: maximum amount of iterations
threshold: tolerance \(\varepsilon\)
start: initial point for optimization
obj_func: objective function to optimize
derivative_f: derivative of obj_func
learning_rate: step size in the gradient descent update, default step is set to 0.05
momentum: momentum factor for update
In the body of the gradient_descent function we set the following variables:
point: current point in the optimization
points: list to store the points visited during optimization
values: list to store the function values corresponding to the points
delta: the update term in the gradient descent formula
i: Iteration counter
diff: difference in function values between consecutive iterations
def f(x):
return x ** 4 - 4 * x ** 2 + 5 * x
def derivative_f(x):
return 4 * x ** 3 - 8 * x + 5
def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f, learning_rate=0.05, momentum=0.5):
point = start
points = [start]
values = [obj_func(start)]
delta = 0
i = 0
diff = 1.0e10
while i < iterations_limit and diff > threshold:
delta = -learning_rate * derivative_f(point) + momentum * delta
point += delta
points.append(point)
values.append(obj_func(point))
diff = abs(values[-1] - values[-2])
i += 1
return points, values
start_point = 2
learning_rate = 0.05
momentum = 0.3
points, values = gradient_descent(100, 0.1, start_point, f, derivative_f, learning_rate, momentum)
The points show the optimization trajectory, representing how the optimization algorithm moves through different points in the search space.
The values correspond to the function values at each optimization point, indicating how well the algorithm is minimizing the objective function over iterations.
The code below defines a set of helper functions for creating interactive 2D and 3D plots using the Plotly library. Each function is designed to generate a specific type of plot or visualization:
plot_function takes arrays x and y representing coordinates and a name for the plot. It creates a line plot using the Scatter class from Plotly, where mode=’lines’ specifies that it should draw lines connecting the data points.
plot_function_3d is for 3D surface plots. It uses the Surface class from Plotly. It takes arrays x, y, and z for 3D coordinates, along with a colorscale and highlightcolor for customization. The function also configures contours and lighting for the 3D plot.
plot_points_3d generates a 3D scatter plot using the Scatter3d class. It takes arrays x, y, and z for 3D coordinates, a symbol for marker shape, colorscale for color variation, and a name for labeling.
plot_points creates a 2D scatter plot using the Scatter class. Arguments are similar as for plot_points_3d function but without a 3rd dimension array. The default marker size is set to 8, but it can be customized.
plot_line generates a line plot in 2D using the Scatter class with mode=’lines’. It takes arrays x and y for coordinates, color for the line color, and name for labeling.
get_layout creates a layout configuration for 2D plots. It takes parameters such as title, title_x, and title_y for plot titles and axis titles. It also allows for additional annotations and specifies whether the legend should be shown.
get_layout_3d creates a layout configuration for 3D plots. It includes titles for the three axes (title_x, title_y, title_z), camera settings, and margin adjustments.
plot_contour_map_3d generates a 3D contour map using the Contour class. It takes arrays x, y, and z for contour coordinates and a colorscale for color variation.
We can pass the list of optimization points and corresponding function values to the visualization function. The code below defines a gradient_descent_visualization_plot function with inner functions:
get_annotations: function that adds labels to each (x, y) point
build_figure: function that visualizes Plotly figure
The objective function is created using NumPy. Then we call gradient_descent_visualization_plot to which we pass gradient descent points and values in order to display the plot.
Gradient Descent for a Bivariate Function#
The same procedure was applied similarly to the case of univariate function considered earlier. But now let’s consider the function of two variable, because real-world scenarios mostly involve complex relationships between various factors. Therefore, further computations involve partial derivatives and finding a local minimum.
Note
Partial derivatives tell us how much a function changes with respect to just one variable, while keeping all other variables constant.
def f(x, y):
return x ** 3 + x ** 2 - y ** 3 - 4 * x + 22 * y - 5
def derivative_f_x(x):
return 3 * x ** 2 + 2 * x - 4
def derivative_f_y(y):
return -3 * y ** 2 + 22
def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f_x, derivative_f_y, learning_rate=0.05,
momentum=0.5):
point = start
points = [start]
values = [obj_func(*start)]
x = point[0]
y = point[1]
delta_x = 0
delta_y = 0
i = 0
diff = 1.0e10
while i < iterations_limit and diff > threshold:
delta_x = -learning_rate * derivative_f_x(x) + momentum * delta_x
delta_y = -learning_rate * derivative_f_y(y) + momentum * delta_x
x += delta_x
y += delta_y
points.append([x, y])
values.append(obj_func(*[x, y]))
diff = abs(values[-1] - values[-2])
i += 1
return points, values
start_point = [4.5, 2]
learning_rate = 0.05
momentum = 0.5
points, values = gradient_descent(10, 0.01, start_point, f, derivative_f_x, derivative_f_y, learning_rate, momentum)
3D plots can visualize a trajectory of the optimization process on a surface for a bivariate function. This can help to see how the algorithm moves toward the minimum.
In the function’s body, we create 1D NumPy arrays x_obj_values and y_obj_values containing 100 evenly spaced values from -10 to 10. These will be used to generate a grid of points for the objective function visualization.
Let’s use np.meshgrid to create a 2D grid from the 1D arrays and evaluate the objective function f(x, y) at each point on the grid and stores the results in z_obj.
Defined a function build_figure that takes the objective function grid (obj_x, obj_y, obj_z) and gradient descent data (gd_x, gd_y, gd_z). After that use the plot_function_3d and plot_points_3d functions to create traces for the objective function surface and gradient descent points.
Below is implemented and drawn a 2d counter plot of previously implemented 3d visualization. We apply plot_contour_map_3d and plot_points functions to create traces for the objective function contours and gradient descent points.
Note
A 2D representation of 3D space makes it easier to interpret the function’s features and the trajectory of the optimization algorithm.
Gradient Descent for a Linear regression model#
To implement the gradient descent for linear regression model (see linear regression section) we follow the steps below:
Import testing dataset
Set manually intercept and slope values for a linear regression model
Implement Mean square error (MSE) (see regression metrics section) function and compute error value for the model.
from sklearn.model_selection import train_test_split
import pandas as pd
dataset_path = 'datasets/experience_salary.csv'
data = pd.read_csv(dataset_path)
intercept = 1
slope = 2
def linear_regression(x, intercept, slope):
return intercept + slope * x
def mean_squared_error(x, intercept, slope, y_actual):
n = len(x)
total_error = 0
for i in range(n):
total_error += + (intercept + slope * x[i] - y_actual[i]) ** 2
return total_error / n
experience = data[['experience']]
salary = data['salary']
X_train, X_test, Y_train, Y_test = train_test_split(experience, salary, test_size=0.2, random_state=42)
X = X_test.values.flatten()
Y = Y_test.values.flatten()
linear_regression_values = [linear_regression(x, intercept, slope) for x in X]
mse = mean_squared_error(X_test.values.flatten(), intercept, slope, Y)
print(f'MSE: {mse}')
MSE: 876.7019407082549
Below is a visualization of best-fit linear relationship according to our model. It has a nested function build_figure that accepts 4 parameters:
x_obj: the x-coordinate data of the testing data
y_obj: the y-coordinate data of the testing data
x_reg: the x-coordinate data of the regression line
y_reg: the y-coordinate data of the regression line
After that plot_points and plot_line functions are used to create traces for testing data points and the regression line. Then we return a go.Figure object with the specified data and layout.
Define function that computes partial derivatives with respect to intercept and slope:
We iterate over each data point in the input arrays x, y_actual
Computes the partial derivatives of the mean squared error with respect to the intercept (derivative_intercept) and slope (derivative_slope) using the formula for the derivative of the mean squared error:
For the intercept: \(\frac{\partial}{\partial \text{intercept}} = 2 (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i])\)
For the slope: \(\frac{\partial}{\partial \text{slope}} = 2 (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i]) \cdot x[i]\)
Return the average derivatives by dividing the accumulated derivatives by the number of data points n
def derivative_mean_squared_error(x, y_actual, intercept, slope):
n = len(x)
derivative_intercept = 0
derivative_slope = 0
for i in range(n):
derivative_intercept += 2 * (intercept + slope * x[i] - y_actual[i])
derivative_slope += 2 * (intercept + slope * x[i] - y_actual[i]) * x[i]
return derivative_intercept / n, derivative_slope / n
Function below allows us getting updated values of intercept and slope parameter. The update involves subtracting the product of the learning rate and the respective partial derivative from the current values of the intercept and slope.
def gradient_descent(x, y, intercept, slope, learning_rate, num_iterations):
for i in range(num_iterations):
partial_derivative_intercept, partial_derivative_slope = derivative_mean_squared_error(x, y, intercept, slope)
intercept -= learning_rate * partial_derivative_intercept
slope -= learning_rate * partial_derivative_slope
return intercept, slope
To call the gradient_descent function, initially we set arbitrary values for hyperparameters learning_rate and amount of iterations.
In each iteration, the algorithm calculates the gradient of the cost function with respect to each parameter. The gradient points in the direction of the steepest increase in the cost.
The parameters are then updated in the opposite direction of the gradient to move towards the minimum of the mean square error function. The update is proportional to the learning_rate, a hyperparameter that determines the size of the steps taken in each iteration.
learning_rate = 0.001
iterations = 1000
gd_intercept, gd_slope = gradient_descent(X, Y, intercept, slope, learning_rate, iterations)
print(gd_intercept, gd_slope)
1.9080791607526073 0.9313247403333909
The code below prints the mean squared error and the predicted values for the testing data using the linear_regression model with gradient descent-optimized parameters.
gd_mse = mean_squared_error(X, gd_intercept, gd_slope, Y)
print(f'MSE with gradient descent: {gd_mse}')
gd_linear_regression_values = [linear_regression(x, gd_intercept, gd_slope) for x in X]
MSE with gradient descent: 29.349835951481545
Test the Mean square error (MSE) result on built-in linear regression model from sklearn library.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
mse = mean_squared_error(Y_test, Y_pred)
print(f'Sklearn mse {mse}')
Sklearn mse 27.65026873284228
Let’s define a function that draws a fit-line linear relationship after having applied the optimization.
Showcases above demonstrated the flexibility and effectiveness of gradient descent in diverse optimization tasks and, essentially, we’ve developed an understanding of what a gradient descent is. So, let’s check out what we’ve learned from this section of machine learning by doing the quiz.